Importing various libraries and previously created markdown from portfolio 1.

include <- function(library_name){
  if( !(library_name %in% installed.packages()) )
    install.packages(library_name)
  suppressMessages(library(library_name, character.only=TRUE))
}

include("tidyverse")
include("knitr")
include("stringr")
include("caret")
include("rvest")
include("DT")
suppressMessages(purl("linkedin.Rmd", output = "part1.r"))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |......                                                           |   9%
  |                                                                       
  |............                                                     |  18%
  |                                                                       
  |..................                                               |  27%
  |                                                                       
  |........................                                         |  36%
  |                                                                       
  |..............................                                   |  45%
  |                                                                       
  |...................................                              |  55%
  |                                                                       
  |.........................................                        |  64%
  |                                                                       
  |...............................................                  |  73%
  |                                                                       
  |.....................................................            |  82%
  |                                                                       
  |...........................................................      |  91%
  |                                                                       
  |.................................................................| 100%
## [1] "part1.r"
source("part1.r")

Intro 1st Dataset (Webscrape)

Here, I will be scraping from a website called Seek (https://www.seek.com.au/). This seems like it is the biggest job search company that posts jobs in Australia. This scrape will consist of position, location, description, and pay. I also wanted to grab the company name, but for some reason, some companies want to keep themselves private so, they are under a different tag and could not be scraped. I have left this out due to the difficulty.

read_jobs <- function(x)
{
  #created a tibble that consist of the scraped data
  scrape <- tibble(position= as.character(),
                   location= as.character(),
                   description= as.character(),
                   pay= as.character()
                   )
  
  #Looped through 200 since that is how many pages Seek provides in their search for all jobs. 
  for(i in 0:200)
  {
      url <- paste(x, i, sep = "")
      
      site <- read_html(url)
  
      #This is the most parent nodes. 
      data <- site %>%
            html_nodes("article._2m3Is-x") #article.
      
      #Grabbing pay data and using html_attr since it can also fill with NA if there is not tag.
      pay <- data %>%
        html_nodes("div.xxz8a1h > span:nth-child(3)") %>%
        html_attr("aria-label")
      
      #Grabbing position data. 
      position <- data %>%
        html_attr("aria-label")
      
      #Doing some initial cleaning to remove unnecessary information other than the position.
      position <- gsub("-.*", "", position)
      
      #Location's grandparent node. 
      location_grand <- data %>%
        html_nodes("span._3FrNV7v")
      
      #Location's parent node
      location_parent <- location_grand %>%
        html_nodes("strong.lwHBT6d")
      
      #Grabbing location data
      location <- location_parent %>% 
        html_nodes(".Eadjc1o") %>%
        html_text() 
      
      #Initial cleaning to the location by removing 'location:'. 
      location <- gsub("location: ", "", location)
      
      #Since there are 2 nodes under the .Eadjclo class, I only got the location node and removed the not location nodes. Every odd node was a location node. 
      location <- location[c(TRUE, FALSE)]
      
      #Grabbing description nodes. 
      description <- data %>%
        html_nodes(".bl7UwXp") %>%
        html_text()
      
      #Company grandparent's node. 
      company_grand <- site %>%
        html_nodes("article._2m3Is-x > span:nth-child(5)")

      #Company's parent node.
      company_parent <- company_grand %>%
        html_nodes("span")

      #Grabbing company nodes. 
      company <- company_parent %>%
        html_nodes("a._3AMdmRg") %>%
        html_attr("aria-label")

      #Creating a tibble with each of the attributes per page. 
      table <- tibble(position= position,
                      location= location,
                      description= description,
                      pay= pay
                      )
    
      #Combining the tibble with the final tibble. 
      scrape <- rbind(scrape, table)
      
  }
  #Return the final tibble with all of the scraped data.
  return(scrape)
}


scraped <- read_jobs("https://www.seek.com.au/jobs?page=")

Now that we scraped the data, we need to do some more cleaning.

#Extract the string where yearly salary is posted then remove the commas. 
scraped$pay <- str_extract(scraped$pay, "\\d+,\\d+")
scraped$pay <- as.numeric(gsub(",","",scraped$pay))

#Remove rows where pay is NA.
scraped <- scraped[!is.na(scraped$pay), ]

#make location column a factor type
levels(scraped$location <- as.factor(scraped$location))
##  [1] "ACT"                                 
##  [2] "Adelaide"                            
##  [3] "Asia Pacific"                        
##  [4] "Bendigo, Goldfields & Macedon Ranges"
##  [5] "Blue Mountains & Central West"       
##  [6] "Brisbane"                            
##  [7] "Bundaberg & Wide Bay Burnett"        
##  [8] "Cairns & Far North"                  
##  [9] "Gold Coast"                          
## [10] "Gosford & Central Coast"             
## [11] "Hobart"                              
## [12] "Kalgoorlie, Goldfields & Esperance"  
## [13] "Katherine & Northern Australia"      
## [14] "Lismore & Far North Coast"           
## [15] "Melbourne"                           
## [16] "Newcastle, Maitland & Hunter"        
## [17] "Perth"                               
## [18] "Richmond & Hawkesbury"               
## [19] "South West Coast VIC"                
## [20] "Southern Highlands & Tablelands"     
## [21] "Sunshine Coast"                      
## [22] "Sydney"                              
## [23] "Toowoomba & Darling Downs"           
## [24] "Wagga Wagga & Riverina"              
## [25] "Wollongong, Illawarra & South Coast"
#since there are some factor levels with only 1 or 2 amount, we will only choose the ones with at least 3 amount.
table(scraped$location)
## 
##                                  ACT                             Adelaide 
##                                   10                                    9 
##                         Asia Pacific Bendigo, Goldfields & Macedon Ranges 
##                                    2                                    1 
##        Blue Mountains & Central West                             Brisbane 
##                                    1                                   39 
##         Bundaberg & Wide Bay Burnett                   Cairns & Far North 
##                                    1                                    2 
##                           Gold Coast              Gosford & Central Coast 
##                                    2                                    1 
##                               Hobart   Kalgoorlie, Goldfields & Esperance 
##                                    2                                    1 
##       Katherine & Northern Australia            Lismore & Far North Coast 
##                                    2                                    2 
##                            Melbourne         Newcastle, Maitland & Hunter 
##                                  143                                    2 
##                                Perth                Richmond & Hawkesbury 
##                                   17                                    1 
##                 South West Coast VIC      Southern Highlands & Tablelands 
##                                    1                                    1 
##                       Sunshine Coast                               Sydney 
##                                    2                                  271 
##            Toowoomba & Darling Downs               Wagga Wagga & Riverina 
##                                    1                                    2 
##  Wollongong, Illawarra & South Coast 
##                                    1
scraped <- subset(scraped, location=="ACT" | location=="Adelaide" | location== "Brisbane" | location== "Melbourne" | location=="Perth" | location == "Sydney" )

#output of table onto html
datatable(scraped, options=list(pageLength=5))

# Analysis

Now, we can start investigating. I want to investigate if we could predict pay with where they are located. This will probably be invalid since there is not enough datapoints and that various jobs pay a different amount. However, we will not know until we find out and try. Let’s begin!

#randomly picking 70% data
sample_selection <- createDataPartition(scraped$pay, p=0.70, list=FALSE)

# #Splitting up 70% to go into our train. and 30% to go into our test. 
train = scraped[sample_selection, ]
test = scraped[-sample_selection, ]

# #Linear model on our dependent variable (pay) with independent variables (location)
train_model <- lm(pay ~ factor(location), data = train)

summary(train_model)
## 
## Call:
## lm(formula = pay ~ factor(location), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -77862  -25062  -10062    4938 1719938 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                75000.0    43870.3   1.710   0.0883 .
## factor(location)Adelaide  -13226.0    59400.7  -0.223   0.8239  
## factor(location)Brisbane     409.5    47760.0   0.009   0.9932  
## factor(location)Melbourne  -4132.2    44830.4  -0.092   0.9266  
## factor(location)Perth     -16188.5    50657.1  -0.320   0.7495  
## factor(location)Sydney      5062.0    44482.2   0.114   0.9095  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98100 on 338 degrees of freedom
## Multiple R-squared:  0.003502,   Adjusted R-squared:  -0.01124 
## F-statistic: 0.2376 on 5 and 338 DF,  p-value: 0.9457
predictions <- train_model %>% predict(test)

#Let's plot it
ggplot(data = test, aes(x = predictions, y = pay)) +
  geom_point() +
  geom_smooth(method = "lm")

R2 <- R2(predictions, test$pay)
RMSE <- RMSE(predictions, test$pay)
MAE <- MAE(predictions, test$pay)

Looking at the summary of the train model, we see that none of the independent variables have any significant p values. Therefore, we cannot predict the amount of pay using location of where the position is. This concurs with my initial hypothesis where I have stated that this research would be invalid due to the insufficient amount of data and various types of jobs that pay at different rates.

Intro to 2nd dataset

This dataset fit really nicely with my first dataset because it has data on people’s profile picture beauty with calculated metrics. Initially, I did not know how to interpret this dataset since there was no documentation. I eventually reached out to the creator of the dataset, Andrew Truman (https://www.linkedin.com/in/kbot/), to ask him a few questions about it. He explained to me what all of the meterics meant and in what units they were. He also explained that he used a software called Face++ (https://www.faceplusplus.com/) to examine profile pictures which used machine learning to give mumerical values to profile pictures. This dataset has the following attributes.

Variable Type Desription
X int index of each profile
avg_n_pos_per_tenure int Average number of positions per tenure
avg_pos_len int Average position length
avg_prev_tenure_len int Average previous tenure length
c_name character Company name
n_pos int Number of positions
n_prev_tenures int Number of previous tenures
tenure_len int Tenure length
age int Age estimate
beauty int Beauty estimate
beauty_female int Beauty estimate as female
beauty_male int Beauty esimate as male
blur int Blur level estimate
blur_gaussian int Another blue level estimate
blur_motion int Blur motion estimate
emo_anger int Anger level estimate
emo_disgust int Disgust level estimate
emo_fear int Fear level estimate
emo_happiness int Happiness level estimate
emo_neutral int Neutral level estimate
emo_sadness int Sadness level estimate
emo_surprise int Suprise level estimate
ethnicity string/categorical Ethnicity estimate
face_quality int Face quality estimate
gender int Gender estimate
glass string/categorical Dark, None, or Normal
head_pitch int Head pitch estimate
head_roll int Head roll estimate
head_yaw int Head Yaw estimate
mouth_close int Mouth close estimate
mouth_mask int Mouth Mask estimate
mouth_open int Mouth open estimate
mouth_other int Mouth other estimate
skin_acne int Skin acne estimate
skin_dark_circle int Skin dark circle estimate
skin_health int Skin health estimate
skin_stain int Skin stain estimate
smile int Smile estimate
african int Africa estimate
celtic_english int Celtic English estimate
east_asian int East Asian estiamte
european int European estimate
greek int Greek estimate
hispanic int Hispanic estimate
jewish int Jewish estiamte
muslim int Muslim estimate
nationality string Nationality estimate
nordic int Nordic estimate
south_asian int South Asian estimate
n_followers int Number of followers on linkedin

Prediction Hypothesis

With this dataset, I want to explore how various work history metrics can predict a person’s beauty (dependent variable). First, we need to import the data and do some initial cleaning.

In this segment, I will try to find if there is any correlation between beauty and other work history metrics. I believe that there is some correlation between how beauty can affect a person’s work. I will make beauty the dependent variable while average position length, average previous tenure length, and tenure length will be the independent variables. But first, we need to do some intiial cleaning.

#Import dataset
linkedin2 <- read.csv("linkedin_data.csv")

#delete columns we don't need.
linkedin2$m_urn <- NULL
linkedin2$img <- NULL

#rename column to company_name instead for clarity.
colnames(linkedin2)[colnames(linkedin2)=="c_name"] <- "company_name"

#change company_name to a factors. 
levels(linkedin2$company_name) <- as.factor(linkedin2$company_name)

datatable(linkedin2, options=list(pageLength=10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Now, we are ready to explore. Let’t see if there is any correlation between beauty and work metrics.

#Randomly choosing 70% of the data. 
sample_selection <- createDataPartition(linkedin2$beauty, p=0.70, list=FALSE)

#Splitting up 70% to go into our train. and 30% to go into our test. 
train = linkedin2[sample_selection, ]
test = linkedin2[-sample_selection, ]

#Linear model on our dependent variable (beauty) with independent variables (avg_pos_len, avg_prev_tenure_len, and tenure_len)
train_model <- lm(beauty ~ avg_pos_len + avg_prev_tenure_len + tenure_len, data = train)
  
summary(train_model)
## 
## Call:
## lm(formula = beauty ~ avg_pos_len + avg_prev_tenure_len + tenure_len, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.787  -8.249  -0.398   7.954  66.638 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.999e+01  9.291e-02 645.665  < 2e-16 ***
## avg_pos_len         -7.965e-04  1.171e-04  -6.802 1.05e-11 ***
## avg_prev_tenure_len -1.460e-03  5.705e-05 -25.599  < 2e-16 ***
## tenure_len          -4.051e-04  8.103e-05  -5.000 5.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 43894 degrees of freedom
## Multiple R-squared:  0.02739,    Adjusted R-squared:  0.02732 
## F-statistic:   412 on 3 and 43894 DF,  p-value: < 2.2e-16
predictions <- train_model %>% predict(test)

#Let's plot it
ggplot(data = test, aes(x = predictions, y = beauty)) + 
  geom_point() + 
  geom_smooth(method = "lm")

R2 <- R2(predictions, test$beauty)
RMSE <- RMSE(predictions, test$beauty)
MAE <- MAE(predictions, test$beauty)

R2
## [1] 0.03288204
RMSE
## [1] 11.60065
MAE
## [1] 9.407543

After this analysis, it seems like average position length, average previous tenure length, and tenure length are significant since their p value is less than 0.05. Though the P values are less than 0.05, it does not mean that they are significant. It seems like they are significant because they are all relatively in the same area on the graph. This needs further investigation. If we look at the r-squared value, it is about 0.027 meaning that only 2.7% of the data points are explained by the independent variables(avg_pos_len, avg_prev_tenure, and tenure_len). With a r-squared value this low, we can conclude that the independent variables do not have much of an ability to predict a person’s beauty.

Conclusion

In this segment of research, I have explored how pay correlates to location. We have figured out that there is no correlation or significant impact on predicting pay with location. This was due to the vast amount of jobs and qualifications. One job may pay 100k, where another job may pay 35k. There is a huge discrepancy between jobs, therefore there was no way to predict pay using location. One thing we can do is to predict pay on which job type is it, meaning we can investigate how much pay a person receives if he/she is a manager, software engineer, or consultant for instance.

I have also found out that is is pretty difficult to web scrape when the website is not in tabular form. The job posting website that I used was Seek.com (https://seek.com) and there were many multiple classes for each tag or random span tags. Also, if the length of the vectors did not match, it would be difficult to combine tables together.

In latter segment, I explored how beauty could be predicted using average position length, average previous tenure, and tenure length. From the statistics point of view, it stated that all three of the independent variables were significant (<0.5 p-value), but when we took a deeper look into the R2, we discovered that only 2.7% of the data points were explained by the independent variables.